pricingengine.estimation package¶

Submodules¶

pricingengine.estimation.double_ml module¶

class pricingengine.estimation.double_ml.DoubleML(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), treatment_builders=None, feature_builders=None, sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True)¶

Bases: pricingengine.estimation.double_ml.DoubleMLLikeModel

Generic Double ML Model. Estimates the coefficient \(\beta\) from the following partially linear model

\(Y = f(X) + \beta \cdot D + \epsilon\)

\(D = g(X) + \mu\)

Note that the base models are cross-fit across folds (so a model’s predictions for its training data are not used).

__init__(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), treatment_builders=None, feature_builders=None, sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True)¶

Initialize a new DoubleML instance.

Parameters:

schema – The expected schema of datasets that will be fit
baseline_model – Instance with subclass Model to be used for computing baseline treatment and outcome prediction models in first stage regressions. This object may also be a dict which points from each column name (all treatment and outcome variables) to a corresponding Model.
causal_model (CausalModel) – Model to be used for computing treatment effects in second stage regression
error_model (Model) – Model to be used for estimating average (absolute) error size as a function of features (i.e. heteroskedasticity function)
feature_builders – List of VarBuilder objects used to create features for first stage regressions
treatment_builders – List of VarBuilder objects used to create treatments for second stage regressions
sample_splitter – member of sklearn.model_selection used for sample splitting. Default is KFold.
cluster_date – Bool (default True) input for whether or not to cluster standard erros at the level of the date column

baseline_outcome_coefficients()¶: Return coefficients (averaged over splits) from first stage outcome regression Will account for baseline feature scaling

baseline_treatment_coefficients(treatment_name)¶: Get first stage coefficients (averaged over splits) from treatment regression corresponding to the given treatment_name. Will account for baseline feature scaling

error_model¶: Return the model used to comptue predicted (absolute) error size

fit_baseline_models_featurized(features, outcome, treatments, folds)¶

Fit first-stage baseline models (but not causal model) for predicting treatment and outcome. Sub-utility used by fit_baseline_models().

Parameters:	features – dictionary of features used for prediction (expects one for error too) outcome – dictionary of leads mapping to series of outcome leads treatments – double dictionary mapping from lead and treatment_name to series of treatment leads folds – list of train test splits used for cross validation

static gen_prepredicted(df)¶: Converts a DataFrame of recorded predictions in dictionary of PrePredicted models

static get_rec_df_from_csv(fname, schema)¶: Reads a csv file with recordings from a DoubleML prediction :param fname: filename of csv of recorded model predictions :param schema: Schema object :returns: DataFrame of prediction recordings

outcome_baseline_models¶: Return the outcome baseline models

predict_baseline(features, folds=None)¶

Parameters:	features – Either a single feature matrix or a dictionary:varname->feature matrix folds –

treatment_baseline_models¶: Return the treatment baseline models

class pricingengine.estimation.double_ml.DoubleMLLikeModel(schema, causal_model, treatment_builders, feature_builders, sample_splitter, cluster_date=True, no_constant=False)¶

Bases: pricingengine.estimation.regression.Estimation

An abstract baseclass for DoubleML-like models (DoubleML/DynamicDML, etc.)

NO_SPLIT = 'no split'¶

TYPE_COL_NAME = 'type'¶

__init__(schema, causal_model, treatment_builders, feature_builders, sample_splitter, cluster_date=True, no_constant=False)¶

Parameters:

schema – The expected schema of datasets that will be fit
causal_model (CausalModel) – Model to be used for computing treatment effects in second stage regression
treatment_builders – List of VarBuilder objects used to create treatments for second stage regressions
feature_builders – List of VarBuilder objects used to create features for first stage regressions
sample_splitter – member of sklearn.model_selection used for sample splitting. Default is KFold.
cluster_date – Bool (default True) input for whether or not to cluster standard erros at the level of the date column
no_constant – Bool (default False) to force the construction of ConstVar treatments with all available interactions. If True, these constants are omitted.

baseline_fit_diagnostics()¶: Get various prediction diagnostics for all baseline (first stage) regressions

baseline_models_feat_info(avg_splits=False, combine_vars=False)¶

Return baseline model coefficients for all first stage models for the given lead

Parameters:	avg_splits (bool) – If true avgs diagnostics acrss model splits (otherwise returns separately). combine_vars – Try to combine the different variable vectors into a df (works if same feature vector) If True, will return a single DF. If false, will return a dict:varname->DF (aggregated across leads)

causal_model¶: Return the causal model

fit_baseline_models(estimation_dataset)¶

Fit baseline (but not causal models) on DDML object

Parameters:	estimation_dataset – EstimationDatset object on which baseline models are fit

fit_causal_model(estimation_dataset, rm_baseline_interm_info=False, subst_treatment_builders=None)¶

Fit only the causal model of DDML. Requires that you have already fit baseline models.

Parameters:	estimation_dataset – rm_baseline_interm_info – If you want to fit several different causal models, pass in rm_baseline_interm_info=False subst_treatment_builders – overwrites existing treatment_builders in case you want to try a different model

num_splits¶: Number of splits for cross-fitting

pricingengine.estimation.dynamic_dml module¶

class pricingengine.estimation.dynamic_dml.BaseAndError(leads)¶

Bases: object

Model that will fit and predict baseline models and error

__init__(leads)¶

baseline_fit_diagnostics()¶

baseline_models_feat_info(avg_splits=False, combine_vars=False)¶

error_model_predict(features_fit)¶

fit_baseline_models_featurized(common_features, lead_features, outcome_lead, treatments_lead, folds)¶

fit_error_model(features_fit, err)¶

static gen_prepredicted(df)¶

predict_baseline(common_features, lead_features, fold_fit_info)¶

class pricingengine.estimation.dynamic_dml.DDMLOptions(min_lead=1, max_lead=1)¶

Bases: object

Options for computing effects using Dynamic DoubleML

__init__(min_lead=1, max_lead=1)¶

Create a new DDMLOptions instance.

Parameters:	min_lead – Smallest lead to model max_lead – Largest lead to model

leads¶: Return list where each element is a number of periods ahead to compute effects

class pricingengine.estimation.dynamic_dml.DynamicDML(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), feature_builders=None, treatment_builders=None, training_filter=None, options=DDMLOptions(1, 1), outcome_model_type='level', treatment_diff_models=[], sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True, cv_structure_fn=None, multi_task=False, no_constant=False)¶

Bases: pricingengine.estimation.double_ml.DoubleMLLikeModel

A series of DoubleML models, each lead contains a separate first stage model that corresponds to forecasting the outcome at a given lead. There is also a common causal_model that corresponds to causal impacts of treatments which are jointly learned from all models.

LEAD_LEVEL_NAME = 'lead'¶

__init__(schema, baseline_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), causal_model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), feature_builders=None, treatment_builders=None, training_filter=None, options=DDMLOptions(1, 1), outcome_model_type='level', treatment_diff_models=[], sample_splitter=KFold(n_splits=2, random_state=None, shuffle=True), cluster_date=True, cv_structure_fn=None, multi_task=False, no_constant=False)¶

Create a new instance of an effect model.

Parameters:

schema (Schema) – The schema for subsequent training and prediction data
baseline_model – Instance with subclass Model to be used for computing baseline treatment and outcome prediction models. This object may also be a dict which points from each column name to a corresponding Model.
causal_model (CausalModel) – Model to be used for computing treatment effects
error_model (Model) – Model to be used for estimating average (absolute) error size as a function of features (i.e. heteroskedasticity function)
feature_builders – List of VarBuilder objects used to create features for first stage regressions
treatment_builders – List of VarBuilder objects used to create treatments for second stage regressions
training_filter – function that takes the feature generator and estimation_dataset and returns a vector of bools which indicates which observations should be used for training. Default is all observations.
options (DDMLOptions) – Model options
outcome_model_type – FeatureGenerator.LEVEL_MODEL (default) trains first stage model in levels. FeatureGenerator.DIFF_MODEL trains first stage outcome model on first differences
treatment_diff_models – list of treatments that are estimated in first differences (Default is LEVEL)
sample_splitter – member of sklearn.model_selection used for sample splitting. Default is KFold.
cluster_date – Bool (default True) input for whether or not to cluster standard erros at the level of the date column
cv_structure_fn – function that takes the df multiindex and returns labelled structure. Used by either GroupKFold or StratifiedKFold. Default is to use the time variable.
multi_task – Bool (default False) to indicate whether one instance of the specified baseline model will be used to make predictions for multiple leads. (Only certain models have this capability, for instance CNTKCausalModel). If False, then N copies of the model will be used to model the outcome for each lead.
no_constant – Bool (default False) to force the construction of ConstVar treatments with all available interactions. If True, these constants are omitted.

static gen_prepredicted_baselines(df, base_error_class=None)¶: Converts a DataFrame of recorded predictions in dictionary of PrePredicted models

get_design_matrices(dataset)¶: Gets all the design (and related) matrices from all the stages :param dataset: Needs to have same schema as estimation dataset but can be much smaller :return: Tuple of baseline_variables, baseline_features, train_fold, causal_outcomes, causal_treatments.

The first two are nested dictionaries of lead->varname->data The inner datasets of the first two, and train_fold all have the same row index so can be concatted. The final two datasets are the causal regression. The causal variables will be the original values (possibly scaled) rather than residuals. For the error model, query the first two with Model.ERROR_VAR_NAME (Note: folds are meaningless here).

get_diffed_vars()¶

get_marginal_effects(treatment_name, competition_col, leads=None, filter_dic=None)¶

static get_rec_df_from_csv(fname, schema)¶: Reads a csv file with recordings from a DynamicDML prediction :param fname: filename of csv of recorded model predictions :param schema: Schema object :returns: DataFrame of prediction recordings

options¶: Return the options given during initialization

outcome_coefficients(lead)¶

Get first stage coefficients (averaged over splits) from outcome regression corresponding to the given lead. Will account for baseline feature scaling

Parameters:	lead – integer corresponding to preferred lead

static translate_prediction_to_rec(pred_df, date_col, exp_ind=True)¶: Takes back the targets according to lead (because in fitting they are lagged to the information date)

static translate_rec_to_prediction(rec_df, leads, date_col)¶: Advances the targets according to lead (in the fitting they were lagged to the information date) and then averages across the folds of the model.

treatment_coefficients(lead, treatment_name)¶

Get first stage coefficients (averaged over splits) from treatment regression corresponding to the given lead and treatment_name. Will account for baseline feature scaling

Parameters:	treatment_name (str) – name of treatment variable lead (int) – preferred lead

class pricingengine.estimation.dynamic_dml.MultiTaskBaseAndError(schema, baseline_model, causal_model, error_model, n_splits, leads)¶

Bases: pricingengine.estimation.dynamic_dml.BaseAndError

__init__(schema, baseline_model, causal_model, error_model, n_splits, leads)¶

baseline_fit_diagnostics()¶

baseline_models_feat_info(avg_splits=False, combine_vars=False)¶

error_model_predict(features_fit)¶

fit_baseline_models_featurized(common_features, lead_features, outcome_lead, treatments_lead, folds)¶

Parameters:	common_features – features common to all leads lead_features – lead-specific features outcome_lead – dict mapping lead to outcome variable values treatments_lead – dict mapping lead to treatment variable values folds – list of train test splits used for cross validation

fit_error_model(features_fit, err)¶

static gen_prepredicted(df)¶

outcome_coefficients(lead)¶

predict_baseline(common_features, lead_features, fold_fit_info)¶

Parameters:	common_features – features common to all leads lead_features – lead-specific features fold_fit_info – same data format as folds variable in fit_baseline_models_featurized

treatment_coefficients(lead, treatment_name)¶

class pricingengine.estimation.dynamic_dml.N_Split(n_splits)¶

Bases: object

__init__(n_splits)¶

pricingengine.estimation.estimation_dataset module¶

class pricingengine.estimation.estimation_dataset.EstimationDataSet(data, schema, validators=frozenset({<pricingengine.estimation.estimation_dataset.ValidPanels object>}), fold_fit_info=None)¶

Bases: pricingengine.estimation.typed_dataset.TypedDataSet

Dataset with known schema used for generating features

__init__(data, schema, validators=frozenset({<pricingengine.estimation.estimation_dataset.ValidPanels object>}), fold_fit_info=None)¶

Parameters:

data – pandas dataframe containing date, units, and price columns with a single column index
schema – Schema describing the data
validators – A list of validators to use for verifying data integrity
fold_fit_info – Series where each value is the index of the model that has this as the test portion or NaN if all folds can have this as test (when fit on subset of dataset). This is None if this dataset has been fit. Will be set after fit. We store this rather than folds since we can filter more easily.

append_data_one_instance(panel_dic, treatments_path, start_date)¶

Returns a new estimation_dataset object with additional rows corresponding to the product specified in panel_dic and the given price_path. The synthetic data will begin on the start_date and carry forward at the same intervals as the rest of the data. If necessary, it will overwrite pre-existing data.

Parameters:

panel_dic – dictionary of panel values that must specify a unique instance
treatments_path – dictionary (keyed by treatment_names) with values as iterables of numbers specifying the planned treatments of that instance week-by-week going forward. The 0th value of each iterable corresponds to the start_date
start_date – First week in which the price_path is applied. I.E. price_path[0] specifies the price on the start_date. This value must be in the estimation_dataset or immediately following an observation in the estimation_dataset.

static convert_folds_across_indexes(orig_folds, orig_idx, new_idx)¶: Converts fold info from one DataFrame index to another

data¶: Returns data

data_interval¶: Return the temporal spacing between consecutive data points

filter(filter_dic=None, first_date=None, last_date=None)¶

Returns a new estimation_dataset object which is filtered by the requirements in the filter_dic

Parameters:	filter_dic – dictionary mapping data columns to lists of allowed values first_date – omit any data before this date last_date – omit any data from after this date

fold_fit_info¶: Returns the folds (test part at least) for what was fit

static from_df(df, treatment_colname='treatment', outcome_colname='units', date_colname='date', is_panel_col=<function EstimationDataSet.<lambda>>, validators=frozenset({<pricingengine.estimation.estimation_dataset.ValidPanels object>}))¶

Create an EstimationDataSet from the given dataframe.

Parameters:

df –
A pandas dataframe containing price, units, and date columns.
- String columns and panel columns will be interpreted as categorical columns
- Float/int columns will be interpreted as numeric columns (convert numeric columns to string if
  
  the column is to be interpreted as categorical)
treatment_colname – The name of the numeric column containing treatment
outcome_colname – The name of the numeric column quantities
date_colname – The name of the datetime column containing dates
is_panel_col – A function that takes in a column names and returns a boolean indicating if the column is used to break the dataset into panels
validators – list of validators applied to the produced EstimationDataSet

gen_folds_for_new_index(new_idx)¶: Converts fold info from this object’s index to another

schema¶: Returns schema

set_folds_from_other_index(other_folds, other_idx)¶: Sets this object’s fold info to that from another context (fold_info and index)

pricingengine.estimation.regression module¶

class pricingengine.estimation.regression.Estimation(schema, cluster_date)¶

Bases: object

__init__(schema, cluster_date)¶

fit(estimation_dataset)¶

Fit baseline and causal models on the given dataset

Parameters:	estimation_dataset (EstimationDataSet) – A dataset on which to train the model

get_coefficients(human_index=True)¶

Get coefficients from the causal model

Parameters:	human_index – If True, then the interactions levels of the multiindex are squashed. Otherwise, they are are left separate (useful for automated post-processing).

get_standard_errors(human_index=True)¶: Get standard errors from the causal model

get_variance_matrix(human_index=True)¶: Get variance matrix from the causal model

predict(dataset, ret_pred=None)¶

Compute predictions for the given dataset using previously trained model

Parameters:	dataset (EstimationDataset) – A dataset containing features from which to generate predictions. The schema of the dataset must match the schema of the dataset used to fit the model. ret_pred – Pass in an empty dataframe if you want that dataframe to be populated with predictions of the first stage models
Raises:	ValueError – If the schema of the given dataset does not match the schema given for initialization RuntimeError – If the model has not yet been fit

class pricingengine.estimation.regression.Regression(schema, model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), regressor_builders=None, cluster_date=True)¶

Bases: pricingengine.estimation.regression.Estimation

Class for implement estimation with VarBuilders

__init__(schema, model=OLS(add_const=False), error_model=LassoCV: LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', random_state=None, selection='cyclic', tol=0.0001, verbose=False), regressor_builders=None, cluster_date=True)¶

Initialize a new Regression instance.

Parameters:	schema – The expected schema of datasets to be fit and transformed model (Model) – used for estimation error_model (Model) – model used to estiamte average abs error (i.e. heteroskedasticity function) regressor_builders – List of VarBuilders used to create regressors

error_model¶: Return the error model

model¶: Return the causal model

pricingengine.estimation.typed_dataset module¶

class pricingengine.estimation.typed_dataset.ColType¶

Bases: enum.Enum

Input IDs for know pricing data column content

OUTCOME = 10¶

OUTCOME_RESIDUAL = 12¶

PREDETERMINED = 13¶

A description of the data contained in a single column

The column tagged as ColType.ITEM must have DataType.CATEGORICAL
The column tagged as ColType.OUTCOME must be have DataType.NUMERIC
The column tagged as ColType.TREATMENT must be have DataType.NUMERIC

TREATMENT = 9¶

TREATMENT_RESIDUAL = 11¶

class pricingengine.estimation.typed_dataset.TypedDataSet(data, schema, required_types)¶

Bases: pricingengine.dataset.DataSet

Dataset class

__init__(data, schema, required_types)¶

Initializes a new instance of the DataSet class. The DataSet class combines time series data, a schema that specifies the column meta-data for the the given time series data.

The given data-schema pair needs to adhere to the following expectations:

Each column defined in the given schema must be contained in the corresponding given time series data
Each column must have a data type corresponding to its schema DataType as follows:
- DataType.NUMERIC: integer or floating-point
- DataType.DATE_TIME: datetime
- DataType.CATEGORICAL: string or integer
In the specified schema, the name of the column with id ITEM must also be included in the list of panel column names

Parameters:	data – The time series data to be used for computing effects schema – The schema specifying the meta-data for the time series

group_labels¶: A list parallel to the rows of the dataset with a label for each row. The labels can be passed to a Pandas groupby call to group data using known groups.

pricingengine.estimation package¶

Submodules¶

pricingengine.estimation.double_ml module¶

pricingengine.estimation.dynamic_dml module¶

pricingengine.estimation.estimation_dataset module¶

pricingengine.estimation.regression module¶

pricingengine.estimation.typed_dataset module¶

Module contents¶

Pricing Engine

Navigation

Related Topics